1 Introduction

1.1 Problem Definition: Understanding People And Data

With the breakthroughs in data science and machine learning, the most popular platforms such as Spotify, Amazon, Youtube, etc. are using recommendation systems to personalize their user’s experience, which offers significant value to both consumers and firms. For consumers, these systems help them reduce the effort of searching for appropriate products among large number of options. The efficiency of these systems is proved by the fact that 60% of choices on Netflix and 35% of sales on Amazon originate from recommendations (Lee et al., 2012). For firms, these systems promote their sales, and establish trust from users. Personalisation is stated to play an important role in success of top e-commerce company. Inspired by recommendation systems’ huge benefits, this qualified-self (QS) project harnesses the power of personal data to have insights into behaviors and preferences of people, which could be used to develop or improve these systems.

There are different factors determining what songs users listen or what products they purchase. For example, as human, we all go through ups and downs in life, which changes our moods on a daily basis. These moods can have a significant impact on our behavior such as music listening, or purchasing goods. Besides, people of the same ages or gender can share common tastes in music or habits of spending. Hence, our group collected and analysed three datasets including songs, expenses, and moods to explore:

  • Patterns of song listening.
  • Patterns of expenses.
  • Relationship between mood and songs/expenses.

The finding is used to examine above assumptions as well as check against findings from other research.

Apart from data collected by group, an individual data is also collected and analysed to find the relationship between users’ moods and features of songs that they listen to such as danceability, energy, etc.

1.2 Data Understanding and Investigation

1.2.1 Group data

Rationale:

  • While automated tools require members to have certain technical skills, logging manually is easy and no training needed.
  • We agreed to score our mood once a day and collect 3 songs/day, which is straightforward and fast. Therefore, running a tool or app to track is not really necessary in this case.
  • Avoid subjectivity: Whether a song is happy or sad is a personal perspective. Therefore, our team agreed to use R/Python libraries to analyse sentiment of songs via their lyrics, which provides objective outputs.
  • For expenses, it is preferable to download file from bank account. This passive approach ensures that members do not miss out any transaction in a day.
  • Although manual log is not recommended, it is the only option for members who do not have bank a card (this is due to their first time to create bank account in Sydney and the delay of card delivery).
  • Data Consistency: all members get lyrics from the same source, which means they have a consistent lyrics content and format. For example, in all lyrics returned by Genius API, burning is burnin’, running is runnin’, etc. As a result, each member can use there own code to process lyrics for the entire group dataset.

Issues:

  • Human error: Data is prone to error because anyone could make a mistake when copying information from external sources and pasting to our Googlesheet.
  • Missing data: There might be dozens of expenses a day, so recording all of them manually is very arduous and inefficient. There might be some transactions missed out.
  • Sentimental analysis accuracy: Using tools to score sentiment of each song can guarantee an objectiveness, which facilitates comparison and data merging process. However, we have to accept the accuracy trade-off. A song can have different meanings to different people (it may be positive to one but negative to others).
  • Hawthorne effect: The Hawthorne effect is a type of reactivity in which individuals modify an aspect of their behavior in response to their awareness of being observed (Schwartz et al., 2013). Member may intentionally manipulate their data.
  • There is a concern about accuracy of sentimental analysis library such as nltk, bing, afinn, etc.

1.2.2 Individual Data

Rationale:

  • Ease of collection: By using scheduler, a larger amount of songs can be collected with ease.
  • Avoid human error: No manual steps needed, all steps including collecting list of songs, randomly picking 5 songs, getting song features, and store in excel file are done automatically by Python program, which guarantees better data quality.
  • Reduce Hawthorne effect: Aware of being observed, a subject can manipulate the research results as they want. To mitigate this (not entirely), the automated collection picks up 5 songs from a list of song so that a subject do not know what songs will be collected for research.

Issue:

  • Despite above significant advantages, it is apparent that building a Python application like this take a long time (depending on the expertise of researcher). Therefore, it is only recommended for people who are familiar with coding and tools.

1.2.3 External Cohort Data

The research also performed analysis on cohort data. There are two cohort data collected: average weekly expenditure of a lone under 35 in NSW (Australian Bureau of Statistics, 2015-16) and music peak preference by age (Stephens-Davidowitz, n.d). These are considered reliable data because they were published by Australian Bureau of Statistics and Spotify. While the former is used to compare against findings in Group expense data, the latter is used to compare against analysis of Group song data.

1.3 Preparing the Data

Data quality is assessed with reference to The Quarzt guide to bad data (Yanofsky, 2018). There are several issues identified in the datasets, which could be classified by their level of impact.

Resolved issues thereby having no impact on quality:

  • Data are missing you know should be there: In Group expense data, there are some dates when people did not have any expenses, which make the date column become non-sequential (72/511 records). Because this can break the data visualization and analysis results, missing dates are added with amount 0 and expense type No spend.
  • Spelling is inconsistent: In Group expense date, Tranportation expense type has a duplicate value Tranport (2/512 records, Appendix D), which can cause an inaccurate analysis result. Therefore, all Transport are converted to Transportation.
  • Outlier: There are some outliers in dataset which are caused by human errors. In Group song data, dates are not between 8/2022 and 9/2022 (2/707 records, Appendix E) and release date is after 2022 (1/707 record, Appendix F).
  • Time format is inconsistent: In Group song data, some duration values are in HH:MM:SS format while others are in MM:SS format, which hinders data processing. Googlesheet is used to reformat the date column.

Minimal impact:

  • Number of female and male are not equal (4 males and 2 females), which can result in unfair comparison. To minimize its effect, sum amount of each gender would not be used for comparison. Measurements like average or proportion is used when grouping the amount by gender.

Serious impact:

  • In Group song data, there are some instrumental pieces of music that do not have lyrics, which are all scored 0 by the sentiment analysis. However, despite having no lyrics, music still expresses sentiment through its tempo, loudness, keys, chords, etc. Hence, Felix is not included in song sentiment analysis.
  • Limited number of songs: According to statistics of Ferjan (2022), an average listens to about 53 songs/day. As the result, calculating the overall sentiment score by date is likely to be inaccurate. Because the result becomes wrong if three chosen songs are negative while the rest of 50 songs are positive.

Inestimable

  • The sentiment analysis is based on unigrams - the sentiment is calculated by adding up score/category assigned to each word of the text. The accuracy of this approach is questioned for several reasons. First, there is need to determine an appropriate size of text to analyze because positive and negative sentiments can be averaged out to about zero with large text (Silge et al., 2022). Second, using unigrams means that you are breaking structure of sentences and losing the context of lexicons. Third, this approach might struggle with irony or sarcasm in which people use positive words to express negativity. In one study, performance of regular sentiment systems on sarcastic tweets dropped by 25%-70% (Weitzel et al., 2016). Other accuracy concerns of sentiment analysis such as colloquialisms, domain-specific, etc. are also discussed by Mohammad (2017) and Roldós (2020).
  • Another cause of inaccuracy might be self-awareness of self-reporting. In a study of Hawthorne effect by Schwartz et al. (2013), monthly electricity usage was reduced by 2.7% when residential consumers received weekly postcards notifying them about their participation in a study. Similarly, awareness of being observed can result in people reducing their expenses or listening to songs aligned with their moods.

2 Analysis

For Group, Song and Expense datasets are analysed separately then each of them is combined with Moods dataset to find relationships. For Individual, Song dataset with Spotify variables is combined with Moods dataset to have deeper insights into relation between songs and moods as well as compare with analysis results from group data.

2.1 Insights into Songs - Group

According to Davies et al. (2022), although music peak preferences vary, a large number of individuals prefer music released in their mid-to-late teens (15 - 19) most, and prefer music released earlier or later in their lives least. However, Figure 2.1 indicates a different trend - only 25% of songs that people listen to were released when they were in teens, 75% of songs were released after they were in twenties. Besides, in most cases, the release date range becomes larger when the age increases, which can be explained by the fact that older people listen to songs earlier. Moreover, some outliers indicate the that people still listen to songs released very early and even before they were born.

Figure 2.1: Age vs Release date.

Figure 2.2 reveals that Females have more narrow range of song duration (around 145 - 309 minutes) and outliers show that the maxiumn of song length is just 377 minutes. Meanwhile, males have broader range of duration (around 85 - 367 minutes) and they listen to songs with long duration (400 - 600 mins).

Figure 2.2: Distribution of song duration.

Based on Figure 2.3, younger members listen to more song genres compared with older members. However, a common trend can be seen from the figure - the most favorite song genres is either Pop or Indie, except for Felix who prefer Newage most. This can be explained by the fact that all members years of birth are from 1990 to 1999, an era of Pop and Indie.

Figure 2.3: Genre popularity by age.

Statistical analysis conducted by Xu et al. (2021) found that males prefer sad songs than female. Similarly, Figure 2.5 also illustrates that women listens to positive songs more than negative songs while it is opposite for men. However, Figure 2.4 reveals that both song sentiments of both males and females experienced fluctuation within a month.

Figure 2.4: Song sentiment score by gender.

Figure 2.5: Number of positive songs and negative songs.

2.2 Insights into Expenses - Group

Figure 2.6 demonstrates the expense of members over a month. It reveals that although members have varied amount of expenses, their expenses witnessed a certain seasonality, which is showed by significant increases in expense at specific points of time. Thyme had a tendency to spend more at every start of the week. Phoenix spent more on every Friday. Javier and Felix spent more in every middle of a week. Others member spent more every Friday-Monday.

Figure 2.6: Expense amount during a month.

In Figure 2.7, the amount spent on each time of purchasing was normally less $100 among all members. However, 25% of Phoenix’s purchases were more than $200. This can be explained by the fact that Phoenix’s first arrival to Australia was from Aug 11 while others had already lived in Australia for at least 1 month. Besides, outliers indicate that Felix, Phoenix and Thyme had several large purchases ($390 - $1100). It suggests that members having fulltime/parttime jobs are likely to spend larger amount of money.

Figure 2.7: Distribution of expense amount.

Figure 2.8 reveals the percentage of expenses for each type by gender. It can be seen that both males and females share a common spending behavior. The expense types from most spent to least spent: Groceries -> Utilities -> Transportation -> Medical.

Figure 2.8: Percentage of expense type by gender.

Figure 2.9 demonstrates the frequency of purchase for each expense type. It suggests that all members purchased Groceries most frequently. Besides, while four out of six members spent on Utilities second most frequently, others (Eren and Javier Pena) spent on it least frequently. This quite contradicts to conclusion from Figure 2.8, which may results from a higher price of Utilities compared to Groceries.

Figure 2.9: Frequency of purchase for each type.

2.3 Insights into Moods - Group

Figure 2.10 illustrates changes in moods during a month of all members. First, there is no common pattern in mood changes among females or males. However, there is a common pattern in mood changes among Phoenix and Felix, which might be due to their moods being affected by dues (low moods a few days before assignment dues and high mood after assignment dues). Other members’ moods are more table.

Figure 2.10: Mood vs date

Figure 2.11 displays how females listened to song according to their moods. It can be concluded that females prefer positive songs over other types regardless of their moods, especially on Blissful life days. Furthermore, the lower their moods were, the more sad songs they listened to.

Figure 2.11: Percentage of each sentiment by mood type - Female.

On the contrary, Figure 2.12 suggested that males prefer negative songs over other types no matter how they felt. Overall, there is no considerable change in number of negative, neutral and sad songs they listened to when their moods changed. Moreover, it can be seen that males listened to more neutral songs than famales.

Figure 2.12: Percentage of each sentiment by mood type - Male.

It has been always assumed that sentiment of songs people listen to is affected by their emotions. However, Figure 2.13 demonstrates that song sentiments and a listener’s moods are not positively correlated. Besides, according to Mind (2022), people tend to overspend to feel better but Figure 2.13 indicates a low correlation between expense amount and moods.

Figure 2.13: Correlation between mood and song/expense.

2.4 Insights into song variables - Individual

Figure 2.14 and Figure 2.15 reveal that mood and song valence are significantly correlated. Because Spotify valence is known as song sentiment, this result was different from the finding in Figure 2.13 which shows a slight correlation - 0.3. There are two reasons for this situation: 1. either valence or sentiment analysis is inaccurate, 2. Songs collected per day do not represent the listener’s mood. Besides, there is only a moderate relation between mood and danceability/energy and no relation between mood and loudness. Therefore, it suggests that other song’s features such as danceability, energy, etc. relates to user moods. These variables are particular helpful for songs without lyrics.

Correlation between song variables and moods.

Figure 2.14: Correlation between song variables and moods.

Figure 2.15: Moods vs positively correlated variables.

2.5 Insights into External co-hort data

A research analysing Spotify data found that men are aged 13-16 when their favorite song is released, and for women, it’s age 11-14. The research concluded that regardless of gender, people are likely to stick to music they listened to the earliest phase of the adolescence. This finding, however, is different from what we found in our project.

Music preference peak vs age.

Figure 2.16: Music preference peak vs age.

Figure 2.17 presents average weekly expenditure of a lone person under 35 in New South Wales. Unlike the finding in Figure 2.8, it is shown that people in NSW spent most on Utilities instead of Groceries. However, transportation and medical still keep their position of being the second least and least spent type.

Figure 2.17: Average weekly expenditure of a lone person under 35 in NSW.

3 Evaluation, Discussion, and Conclusions

The intention of this project is to find patterns in song listening, spending, and moods, as well as their relation. Therefore, various types of stakeholders that might be involved including e-commerce companies, music streaming companies, etc. In this section, different frameworks are applied to identify current issues as well as potential issues that can arise when project is conducted at larger scale.

3.1 Legalities and Privacy

Australian Privacy Principles (APP) is applied to evaluate Legalities and Privacy aspect of this project. The APP regulates how personal data should be handled, whether or not the data is accurate. Personal data also includes personal tastes, preferences, transaction history, music listening history, etc. Therefore, data usage in this project is subject to Privacy Act. However, the project does not fall under regulations of APP because it was conducted by a small group, not a business with over $3M revenue.

Imagining that the project is conducted at larger scale, compliance of the project with APP is checked in details in Appendix B. In summary, the project complies fully with 11/13 principles and partially complies with 2/13 principles. First, the project does not fully comply with APP 1 because the team did not take the APP into account throughout the project, thereby not establishing transparent practices, procedures, and systems to explicitly ensure our compliance with other APPs. However, the team already considered legal and privacy issues, therefore, apply several practices that align with other APPs. Second, APP 11 is not fully complied due to the risk of data loss and unauthorised modification: there is only one Google Sheet file collecting all data points and a member can intentionally/accidentally modify others’ records. Lastly, it should be noted that the project does not violate APP 7 because the project uses de-identified dataset that does not contain any email, phone numbers, or addresses. Besides, the stakeholders are expected to recommend their products via their users’ web page instead of direct communication via SMS, telephone, email, etc.

There are several strategies which could have been done to address the above misconduct. First, the team should have investigated into legalities then selected specific frameworks, law cases, etc. to apply throughout the project. For example, to comply with APP 1, the team needs to agree on using APP as a legal guidline, then establish clear process of maintaining APPs. Second, to prevent data loss, there should have been several backups, which can be stored in different secured platforms such as Drive, Dropbox, AWS, or Azure. This practice prevents losing data due to unexpected infrastructure destruction of any repository. Third, unauthorized modification could have been prevented by making backup read-only. Team members are only allowed to edit one dataset for data correction purpose, other backups are all read-only.

The project also uses data from third-party including Commonwealth Bank (transaction data) and Spotify (song data). There has always been a controversy over how applications deliver their terms and conditions. Citizen Advice reports that only 1% of people reading terms and conditions fully, which raised a significant concern about privacy (Ketchell, 2016). Wordy and unclear documents prevent users from being fully informed of their contract with providers. As the result, they do not acknowledge how their data is controlled or traded.

3.2 Ethics

Not everything that is legal is ethical. Unlike legal principles, data ethics are established to evaluate potentially adverse impact of data practices on people society, thereby defining good practices for collecting, analysing, and sharing data. Open Data Institute’s Data Ethics Canvas (Appendix C) is the framework used for evaluating ethical aspects of this project.

Apart from ethical considerations in ODI, there are also several ethical issues around using users’ emotion for recommendation. In Jan 2021, Spotify was granted patent to use speech recognition to identify users’ feelings as a way to recommend songs. However, this innovation was strongly disapproved by nearly 200 musicians, bands, and civil rights groups due to concerns about monitoring people’s emotions, or even worse, manipulating them (Schwartz, 2021). With power of AI’s algorithm, detecting or altering human emotions via songs is no longer fictitious. Not only Spotify, other companies such as Amazon, Toyota, Cerene, etc. also have patent for similar functionality. Additionally, without a public and transparent explanation for their technology, we do not know what kind “voice” captured by this technology. Is it just a sound or a normal conversation ? If it is the latter, these companies are likely to violate privacy principles.

3.3 Limitations, Future Work, Reflection

During the project, there are several difficulties that I encountered.

First, as mentioned above, due to lack of song lyrics, Felix’s songs are removed from sentiment analysis. However, this removal is not an optimal solution because the data is still valuable and findings could have been different if it had been included. This situation is also likely to happen in wider data science practice because music can be either instrumental or lyrical. Findings in Insights into song variables indicated a strong relation between mood and Spotify variables. Therefore, these variables can be replaced with lyrics in sentiment analysis, which avoids unnecessary data exclusion.

Second, the project should have been assessed using more Legal, Privacy and Ethical Frameworks, especially when it is conducted by organisation. For example, if stakeholder is a multinational company who collects data of citizens over the world, laws of different countries must be now considered.

Third, there is a concern about inaccurate findings due to lack of data. All data science projects need large enough sample to guarantee the reliability and greater precision of result. In this project, lack of data issue is caused by two main reasons: short period of time and manual log. In a larger project, data collecting should continue until it reaches certain data points. Besides, automated data collection is recommended due to its ease of collecting large data. For wider data science project, technologies such as Snowflake, Spark, Hadoop, etc. are commonly-used for data management and processing. Tools such as Excel or Drive is not preferable for data science projects.

4 References

Lee, & Hosanagar, K. (2019). How Do Recommender Systems Affect Sales Diversity? A Cross-Category Investigation via Randomized Field Experiment. Information Systems Research, 30(1), 239–259. https://doi.org/10.1287/isre.2018.0800

Stassen, M. (2021, January 27). Spotify’s Latest Invention Monitors Your Speech, Determines Your Emotional State… And Suggests Music Based On It. Music Business Worldwide. https://www.musicbusinessworldwide.com/spotifys-latest-invention-will-determine-your-emotional-state-from-your-speech-and-suggest-music-based-on-it/

Yanofsky, D. 2018, The Quartz guide to bad data, Github, viewed June 6, 2020, https://github.com/Quartz/bad-data-guide.

Moore, S. (2018). How to Create a Business Case for Data Quality Improvement. Gartner. www.gartner.com/smarterwithgartner/how-to-create-a-business-case-for-data-quality-improvement/.

IBM. (2018). Extracting Business Value From The 4 Vs of Big Data. https://www.ibmbigdatahub.com/infographic/extracting-business-value-4-vs-big-data.

Weitzel, L., Prati, R., & Aguiar, R. (2016). The Comprehension of Figurative Language: What Is the Influence of Irony and Sarcasm on NLP Techniques?. 10.1007/978-3-319-30319-2_3

Silge, J, & Robinson, D. (2022). Text Mining with R: A Tidy Approach. O’Reilly.

Mohammad, S. M. (2017). Challenges in Sentiment Analysis. A Practical Guide to Sentiment Analysis, (61-83). 10.1007/978-3-319-55394-8_4.

Roldós, I. (2020, December 23). Major Challenges of Natural Language Processing (NLP). Monkey Learn. https://monkeylearn.com/blog/natural-language-processing-challenges/

Ferjan, M. (2022, August 19). 30+ Official Listening to Music Statistics. Headphones Addict. https://headphonesaddict.com/listening-to-music-statistics/

Schwartz, D., Fischhoff, B., Krishnamurti, T., & Sowell, F. (2013). The Hawthorne Effect and Energy Awareness. Proc Natl Acad Sci U S A, 110(38), 15242–15246. 10.1073/pnas.1301687110

Davies, C., Page, B., Driesener, C. et al. (2022). The Power of Nostalgia: Age and Preference For Popular Music. Mark Lett. https://doi.org/10.1007/s11002-022-09626-7

Stephen-Davidowitz, S. (2018, February 10). Opinion: The Songs That Bind. The New York Times. https://www.nytimes.com/2018/02/10/opinion/sunday/favorite-songs.html

Stern, M. J. (2014, August 12). Neural Nostalgia: Why Do We Love The Music We Heard As Teenagers?. Slate. https://slate.com/technology/2014/08/musical-nostalgia-the-psychology-and-neuroscience-for-song-preference-and-the-reminiscence-bump.html#:~:text=And%20researchers%20have%20uncovered%20evidence,t%20weaken%20as%20we%20age.

Ong, T. (2018, February 2). Our Musical Tastes Peak As Teens, Says Study. The Verge. https://www.theverge.com/2018/2/12/17003076/spotify-data-shows-songs-teens-adult-taste-music

Stephens-Davidowitz, S. (n.d.). Research. Sethsd. http://sethsd.com/research

Xu, L., Zheng, Y., Xu, D., & Xu, L. (2021). Predicting the Preference for Sad Music: The Role of Gender, Personality, and Audio Features. IEEE Access. 1-1. 10.1109/ACCESS.2021.3090940.

Library. (n.d.). 6.2. The Evolution of Popular Music. https://open.lib.umn.edu/mediaandculture/chapter/6-2-the-evolution-of-popular-music/

Mind. (2022 April). The Link Between Money and Mental Health. https://www.mind.org.uk/information-support/tips-for-everyday-living/money-and-mental-health/the-link-between-money-and-mental-health/#mental-health-can-affect-money

AIAAIC. (n.d.). About the AIA Repository. https://www.aiaaic.org/aiaaic-repository/about-the-aiaaic-repository

Ketchell, M. (2016, July 15). Never Read The Terms and Conditions? Here’s An Idea That Might Protect Your Online Privacy. The Conversation. https://theconversation.com/never-read-the-terms-and-conditions-heres-an-idea-that-might-protect-your-online-privacy-62208

Thornhill, J. (2016, January 20). Brave New Era In Technology Needs New Ethics. Financial Times. https://www.ft.com/content/dd328bf4-a25e-11e5-8d70-42b68cfae6e4

Office of The Australian Information Commissioner. (n.d.). Australian Privacy Principles Quick Reference. https://www.oaic.gov.au/privacy/australian-privacy-principles/australian-privacy-principles-quick-reference

Schwartz, E. H. (2021, May 7). Musicians Demand Spotify Not Develop Emotional Speech Recognition Patent. Voicebot. https://voicebot.ai/2021/05/07/musicians-demand-spotify-not-develop-emotional-speech-recognition-patent/

Davie, O. (2021, October 2). Spotify Patents Tech To Monitor Your Speech, Infer Emotion. HypeBot. https://www.hypebot.com/hypebot/2021/02/spotify-patents-tech-to-monitor-your-speech-infer-emotion.html

Australian Bureau of Statistics. (2015-16). Household Expenditure Survey, Australia: Summary of Results. ABS. https://www.abs.gov.au/statistics/economy/finance/household-expenditure-survey-australia-summary-results/latest-release.

Stephens-Davidowitz, S. (n.d.). Research. Sethsd. http://sethsd.com/research

5 Appendices

5.1 Appendix A: Data sample

Songs - Group

## # A tibble: 662 × 14
##   date       title   artist genre durat…¹ release_…² lyrics ident…³ gender   age
##   <date>     <chr>   <chr>  <chr> <chr>   <date>     <chr>  <chr>   <chr>  <dbl>
## 1 2022-08-17 Heart … LANY   Alte… 03:19:… 2020-10-02 heart… Javier… Male      30
## 2 2022-08-17 A Sky … Coldp… Alte… 04:29:… 2014-05-02 a sky… Javier… Male      30
## 3 2022-08-17 Your n… Demxn… Soul  02:35:… 2018-04-07 danci… Javier… Male      30
## 4 2022-08-18 She kn… Demxn… Soul  03:40:… 2017-09-30 take … Javier… Male      30
## 5 2022-08-18 ILYSB   LANY   Alte… 04:05:… 2015-12-11 ilysb… Javier… Male      30
## 6 2022-08-18 Thick … LANY   Alte… 03:32:… 2018-09-11 thick… Javier… Male      30
## 7 2022-08-19 Heart … LANY   Alte… 03:19:… 2020-10-02 heart… Javier… Male      30
## 8 2022-08-19 Learn … Foo F… Alte… 03:56:… 1999-09-18 learn… Javier… Male      30
## # … with 654 more rows, 4 more variables: duration_min <dbl>,
## #   formatted_lyrics <chr>, score <dbl>, sentiment <chr>, and abbreviated
## #   variable names ¹​duration, ²​release_date, ³​identifier
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Expenses - Group

## # A tibble: 538 × 6
##   date       amount description type           identifier gender
##   <date>      <dbl> <chr>       <chr>          <chr>      <chr> 
## 1 2022-08-15   15   <NA>        Utilities      Arnald     Female
## 2 2022-08-16   10   <NA>        Utilities      Arnald     Female
## 3 2022-08-17    0   <NA>        No spend       Arnald     Female
## 4 2022-08-18    0   <NA>        No spend       Arnald     Female
## 5 2022-08-19   20   <NA>        Transportation Arnald     Female
## 6 2022-08-19   32.0 <NA>        Groceries      Arnald     Female
## 7 2022-08-20    0   <NA>        No spend       Arnald     Female
## 8 2022-08-21    0   <NA>        No spend       Arnald     Female
## # … with 530 more rows
## # ℹ Use `print(n = ...)` to see more rows

Moods - Group

## # A tibble: 253 × 5
##   date        mood identifier  gender mood_type    
##   <date>     <dbl> <chr>       <chr>  <chr>        
## 1 2022-08-11     6 Javier Pena Male   Surviving    
## 2 2022-08-12     6 Javier Pena Male   Surviving    
## 3 2022-08-13     7 Javier Pena Male   Blissful life
## 4 2022-08-14     8 Javier Pena Male   Blissful life
## 5 2022-08-15     6 Javier Pena Male   Surviving    
## 6 2022-08-16     7 Javier Pena Male   Blissful life
## 7 2022-08-17     6 Javier Pena Male   Surviving    
## 8 2022-08-18     7 Javier Pena Male   Blissful life
## # … with 245 more rows
## # ℹ Use `print(n = ...)` to see more rows

Song variables - Individual

## # A tibble: 215 × 6
##   date       name                                 dance…¹ energy loudn…² valence
##   <date>     <chr>                                  <dbl>  <dbl>   <dbl>   <dbl>
## 1 2022-08-11 When I Was Your Man                    0.612  0.28    -8.65   0.387
## 2 2022-08-11 Higher Love                            0.693  0.678   -7.16   0.404
## 3 2022-08-11 Dive                                   0.654  0.787   -5.00   0.409
## 4 2022-08-11 Mariposa                               0.676  0.525   -5.88   0.421
## 5 2022-08-11 Don’t Wake Me Up                       0.621  0.747   -5.08   0.426
## 6 2022-08-12 My Head & My Heart                     0.614  0.934   -3.71   0.436
## 7 2022-08-12 Wellerman - Sea Shanty / 220 KID x …   0.722  0.893   -3.26   0.439
## 8 2022-08-12 Pepas                                  0.762  0.766   -3.96   0.442
## # … with 207 more rows, and abbreviated variable names ¹​danceability, ²​loudness
## # ℹ Use `print(n = ...)` to see more rows

5.2 Appendix B: Project alignment with APP

Principle Title Purpose Compliance
APP 1 Open and transparent management
of personal information
Ensures that APP entities manage personal information in an open and transparent way.
This includes having a clearly expressed and up to date APP
privacy policy.
Partially
APP 2 Anonymity and pseudonymity Requires APP entities to give individuals the option of not identifying
themselves, or of using a pseudonym. Limited exceptions apply.
Yes
APP 3 Collection of solicited
personal information
Outlines when an APP entity can collect personal information that is solicited.
It applies higher standards to the collection of ‘sensitive’ information.
Yes
APP 4 Dealing with unsolicited
personal information
Outlines how APP entities must deal with unsolicited personal information. Yes
APP 5 Notification of the collection
of personal information
Outlines when and in what circumstances an APP entity that collects
personal information must notify an individual of certain matters.
Yes
APP 6 Use or disclosure
of personal information
Outlines the circumstances in which an APP entity may use or disclose
personal information that it holds.
Yes
APP 7 Direct marketing An organisation may only use or disclose personal information for direct
marketing purposes if certain conditions are met.
Yes
APP 8 Cross-border disclosure
of personal information
Outlines the steps an APP entity must take to protect personal information
before it is disclosed overseas.
Yes
APP 9 Adoption, use or disclosure
of government related identifiers
Outlines the limited circumstances when an organisation may adopt a
government related identifier of an individual as its own identifier, or use
or disclose a government related identifier of an individual.
Yes
APP 10 Quality of personal information An APP entity must take reasonable steps to ensure the personal information
it collects is accurate, up to date and complete. An entity must also take
reasonable steps to ensure the personal information it uses or discloses is
accurate, up to date, complete and relevant, having regard to the purpose of
the use or disclosure.
Yes
APP 11 Security of personal information An APP entity must take reasonable steps to protect personal information
it holds from misuse, interference and loss, and from unauthorised access,
modification or disclosure. An entity has obligations to destroy or de-identify
personal information in certain circumstances.
Partially
APP 12 Access to personal information Outlines an APP entity’s obligations when an individual requests to be given
access to personal information held about them by the entity. This includes
a requirement to provide access unless a specific exception applies.
Yes
APP 13 Correction of personal information Outlines an APP entity’s obligations in relation to correcting the personal
information it holds about individuals.
Yes

5.3 Appendix C: ODI Data Ethics Canvas

5.4 Appendix D:

type
Utilities
Groceries
Transportation
Transport
Medical

5.5 Appendix E:

## # A tibble: 2 × 10
##   date       title      artist genre durat…¹ relea…² lyrics ident…³ gender   age
##   <date>     <chr>      <chr>  <chr> <time>  <chr>   <chr>  <chr>   <chr>  <dbl>
## 1 0202-08-17 Reconfigu… Other… Indie 03:27   04/05/… i won… Thyme   Male      27
## 2 0202-08-17 Basket Ca… Green… Punk… 03:01   29/08/… baske… Eren    Male      24
## # … with abbreviated variable names ¹​duration, ²​release_date, ³​identifier

5.6 Appendix F:

date title artist genre duration release_date identifier gender age
2022-09-04 The Ringer Eminem HipHop 05:37:00 2108-08-31 Eren Male 24